Peking University -- VisualMen

VAST 2009 Challenge
Challenge 1: -  Badge and Network Traffic

Authors and Affiliations:

Hanqi Guo, Peking University, guohanqi@gmail.com

Jie Liu, Peking University, jieliunju@gmail.com

Xiaoru Yuan, Peking University, xiaoru.yuan@gmail.com

Tool(s):

We developed a set of toolkits designated for this task, including twogeneral components, QVisualizer and DestIPTree respectively.

 

QVisualizer organizes and visualizes prox and network traffic data according to the occurring time stamps. The information of one person of a particular day is shown as small color stripes in one row. Information can be organized to show each person’s activities of the whole month in the person-view mode, or show everyone’s activities in one specified day in day-view mode. Filtering operations are also implemented to facilitate the data exploration and analysis.

 

DestIPTree is designed to analyze relationships between network traffic and each computer’s behavior employing the Treemap technique. All destination IP addresses are organized into a 4-level tree according to IP bytes. The sizes of grids of the treemaps are measured by the visit count of certain address. And each grid has a color defined by the size of data uploaded to this address. Finally, the network records of each computer are overlapped the IP treemaps.

 

QVisualizer and DestIPTree are linked together. We use QVisualizer as the major tool to discover abnormal behaviors of employees and computers. DestIPTree is utilized to find out network traffic patterns.

 

Video:

 

       Video for challenge one

 

ANSWERS:


MC1.1: Identify which computer(s) the employee most likely used to send information to his contact in a tab-delimited table which contains for each computer identified: when the information was sent, how much information was sent and where that information was sent. 

Please link to the file here.

 


MC1.2:  Characterize the patterns of behavior of suspicious computer use.

1.2.1 Problem Analysis and Data Preprocess.

The statistics of all data records reveals that there are 115414 network access operations, and 20243 IP addresses are visited in total, including 19221 visited-once IP addresses. The IP port might be helpful to filter addresses, since port 80 is usually used by web servers and 25 by mail servers. 8080 is more complicated, one of its usages is for proxy.

 

We sorted different fields of datasets, and found out that all visited-once IP address use port 80. The uploaded data size ranges from 100 to 13687307, and the downloaded data sizes ranges from 2045 to 10000000.

 

1.2.2 Tool Design

The native data are stored in two types; one is the prox data, and the other the IP log data. We implemented a tool which has two windows to display and compare them. The prox events and network accesses are located on timelines in the window respectively. According to the background, the employees are supposed to use their own machines, so we assume that we can create correspondences between the employee ID and the IP address. The comparison results with our tool make a further proof of the assumption, thus we can combine the two sources of data together. The two windows can also be merged into one, and that is the basic concept of QVisualizer.

 

Further more, we attempted to discover the patterns of network accesses. As facing large amount of IP addresses, we initially viewed the IP addresses and user IDs in a matrix. The IP addresses were sorted by date and time, and an LOD scheme was introduced here to help observation. However, we did not get too much useful information. To avoid its disadvantages, the Treemap technique is utilized. The user IDs are represented in spots, overlapped on the Treemap, so that we can easily get their relationships (Figure 1).

 

Figure 1 Overview of our program

 

1.2.3 Data Exploration

We observe data features via our tools. We soon found some abnormal events against our previous “one user one machine” rule. Therefore it is easy for us to design an automatic exception detection scheme based on the rule. If the employee is in the classified room, in theory, his computer should not upload any dataset through internet. So if there are upload records in this period, we consider it as an exception. The QVisualizer reports 8 records in this case (Figure 2).

 

Figure 2 Auto-detection of abnormal events

 

We also manually check datasets in QVisualizer, the flexible display modes and interactions allow us fast view all datasets. For instance, employee #13 usually comes to work at around 10:00am, but on Jan. 22nd, his computer uploaded a large number of data at 8:50am, but the log system indicates he came to office as usual, at 10:00am. We compared this case with the previous 8 records by printing all their information, and common features are discovered: All these suspicious network accesses point to the same IP address (100.59.151.133), and request-response ratios are also quite large values. The patterns are probably found (Figure 3).

 

Figure 3 Detection of the abnormal activity of employee #13

 

We use the brush tool of QVisualizer to trace all network access records having the same destination IP, and found other 9 records. They do not illustrate obvious exceptional activities, but we still get some patterns. All these activities uploaded data of large size, and display single symbols on QAnalyzer’s timeline, active network accesses happened long before or after this single event. For example, employee #18, at 15:15pm, Jan. 17th, uploaded 12398 bytes of data, and two hours later, at 17:57pm, he accessed the suspicious IP address, and uploaded 5873546 bytes of data. What’s more, there is no other activity record for employee #18 after 17:20pm in all 30 days (Figure 4).

 

Figure 4 Trace other abnormal activities

 

 

1.2.4 Conclusion.

All evidences above can characterize the patterns of behavior of suspicious computer use. All abnormal network accesses direct to the same IP address, and upload great deal of data. The time is also uncommon, they usually happened on computers, user of which might be in the classified area at the same time, or they are the only network activity during a long period of time.

 

At last, we try to use these evidences and data records to find the suspected employee. As we conceded, there are many computers leaking data, but there is only one suspected employee. So this man is very likely to use other people’s computers to send data when they are not in the office. We check everyone to investigate whether he is in the security area when his machine is leaking data and if so, exclude him from our suspect list. We finally get employee #27 who matches characteristics listed above well.